Change TerminatePhase return value #6335

knelli2 · 2024-10-16T16:55:36Z

Now TerminatePhase returns Parallel::AlgorithmExecution::Halt. Also added a new action PausePhase which returns
Parallel::AlgorithmExecution::Pause if that is needed.

This changes how the algorithm ends at the end of a phase.

Proposed changes

This is a fix for a bug in nodegroup parallization, but also a good thing to do. Once we get to the end of an PDAL (and we aren't looping), we shouldn't be allowed to restart as many phases are only intended to run once

Upgrade instructions

Code review checklist

The code is documented and the documentation renders correctly. Run
make doc to generate the documentation locally into BUILD_DIR/docs/html.
Then open index.html.
The code follows the stylistic and code quality guidelines listed in the
code review guide.
The PR lists upgrade instructions and is labeled bugfix or
new feature if appropriate.

Further comments

Now TerminatePhase returns Parallel::AlgorithmExecution::Halt. Also added a new action PausePhase which returns Parallel::AlgorithmExecution::Pause if that is needed. This changes how the algorithm ends at the end of a phase.

wthrowe

Not a full review, but I think you need to respond to @kidder's concerns from elsewhere, reproduced below:

Retry tells the distributed object to wait for more data before resuming the algorithm. The algorithm is still considered to be running on the distributed object. A subsequent call to receive_data will attempt to run the same action after putting data into the Inbox

Halt stops the algorithm. Nothing can restart the algorithm of a halted distributed object until the new phase is determined. This is used to execute a triggered phase change such as checkpointing or AMR

Pause stops the algorithm, but lets either something calling receive_data(instance, data, true) or perform_algorithm(true) (where the trues override defaulted false arguments) on the paused distributed object to restart the algorithm.

The member variable terminate_ is true if the algorithm is stopped. This happens for either Halt or Pause being called

The member variable halt_algorithm_until_next_phase_ is set true when Halt is called

There are currently two instances of perform_algorithm(true), one is in an example Geoffrey uses in his tutorial, the other is in the linear solver action BuildMatrix

So calling perform_algorithm() on a Paused or Halted object should not have done anything. So I am confused by the [diagnosis of a restart being caused by a reduction]

wthrowe

Review of code changes. Still need to be convinced this isn't just hiding some other bug.

wthrowe · 2024-10-16T19:44:12Z

tests/Unit/Time/Actions/Test_SelfStartActions.cpp

-                 step_actions>>;
+  // This test doesn't operate exactly like how SelfStart would normally work in
+  // an executable. Instead it jumps around quite a lot. Therefore to avoid any
+  // issues that TerminatePhase would cause, we just replace it with PausePhase.


Just remove it. No reason to replace it with a no-op.

wthrowe · 2024-10-16T19:44:51Z

tests/Unit/Evolution/Systems/Cce/Actions/Test_AnalyticBoundaryCommunication.cpp

@@ -209,10 +209,10 @@ SPECTRE_TEST_CASE(
          std::make_unique<::TimeSteppers::DormandPrince5>()));

  // Run the initializations
-  for (size_t i = 0; i < 6; ++i) {
+  for (size_t i = 0; i < 5; ++i) {


There are a ton of unexplained changes like this in this PR. What is going on?

There are 5 actions in the initialization phase for this component. Previously the last one returned Pause, so when it called ActionTesting::next_action one extra time, nothing bad happened. But now that the last action returns Halt, calling it an extra time causes an error. So it was a bug previously to call it 6 times. Same for the rest of the changes like this

Change TerminatePhase return value

cd7f036

Now TerminatePhase returns Parallel::AlgorithmExecution::Halt. Also added a new action PausePhase which returns Parallel::AlgorithmExecution::Pause if that is needed. This changes how the algorithm ends at the end of a phase.

nilsvu approved these changes Oct 16, 2024

View reviewed changes

nilsvu requested a review from nilsdeppe October 16, 2024 17:38

wthrowe requested changes Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change TerminatePhase return value #6335

Change TerminatePhase return value #6335

knelli2 commented Oct 16, 2024

wthrowe left a comment

wthrowe left a comment

wthrowe Oct 16, 2024

wthrowe Oct 16, 2024

knelli2 Oct 16, 2024

Change TerminatePhase return value #6335

Are you sure you want to change the base?

Change TerminatePhase return value #6335

Conversation

knelli2 commented Oct 16, 2024

Proposed changes

Upgrade instructions

Code review checklist

Further comments

wthrowe left a comment

Choose a reason for hiding this comment

wthrowe left a comment

Choose a reason for hiding this comment

wthrowe Oct 16, 2024

Choose a reason for hiding this comment

wthrowe Oct 16, 2024

Choose a reason for hiding this comment

knelli2 Oct 16, 2024

Choose a reason for hiding this comment